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Abstract 

With the recent proliferation of large-scale learning problems, 
there have been a lot of interest on distributed machine learn¬ 
ing algorithms, particularly those that are based on stochastic 
gradient descent (SGD) and its variants. However, existing 
algorithms either suffer from slow convergence due to the in¬ 
herent variance of stochastic gradients, or have a fast linear 
convergence rate but at the expense of poorer solution qual¬ 
ity. In this paper, we combine their merits by proposing a fast 
distributed asynchronous SGD-based algorithm with variance 
reduction. A constant learning rate can be used, and it is also 
guaranteed to converge linearly to the optimal solution. Ex¬ 
periments on the Google Cloud Computing Platform demon¬ 
strate that the proposed algorithm outperforms state-of-the- 
art distributed asynchronous algorithms in terms of both wall 
clock time and solution quality. 

Introduction 

With the recent proliferation of big data, learning the pa¬ 
rameters in a large machine learning model is a chal¬ 
lenging problem. A popular approach is to use stochas¬ 
tic gradient descent (SGD) and its variants (Bottou 2010; 
Dean et al. 2012; Gimpel, Das, and Smith 2010). However, 
it can still be difficult to store and process a big data set on 
one single machine. Thus, there is now growing interest in 
distributed machine learning algorithms. (Dean et al. 2012; 
Gimpel, Das, and Smith 2010; Niu et al. 2011; Shamir, Sre- 
bro, and Zhang 2014; Zhang and Kwok 2014). The data set 
is partitioned into subsets, assigned to multiple machines, 
and the optimization problem is solved in a distributed man¬ 
ner. 

In general, distributed architectures can be categorized 
as shared-memory (Niu et al. 2011) or distributed-memory 
(Dean et al. 2012; Gimpel, Das, and Smith 2010; Shamir, 
Srebro, and Zhang 2014; Zhang and Kwok 2014). In this 
paper, we will focus on the latter, which is more scalable. 
Usually, one of the machines is the server, while the rest 
are workers. The workers store the data subsets, perform lo¬ 
cal computations and send their updates to the server. The 
server then aggregates the local information, performs the 
actual update on the model parameter, and sends it back to 
the workers. Note that workers only need to communicate 
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with the server but not among them. Such a distributed com¬ 
puting model has been commonly used in many recent large- 
scale machine learning implementations (Dean et al. 2012; 
Gimpel, Das, and Smith 2010; Shamir, Srebro, and Zhang 
2014; Zhang and Kwok 2014). 

Often, machines in these systems have to run syn¬ 
chronously (Boyd et al. 2011; Shamir, Srebro, and Zhang 
2014). In each iteration, information from all workers need 
to be ready before the server can aggregate the updates. This 
can be expensive due to communication overhead and ran¬ 
dom network delay. It also suffers from the straggler prob¬ 
lem (Albrecht et al. 2006), in which the system can move 
forward only at the pace of the slowest worker. 

To alleviate these problems, asynchronicity is introduced 
(Agarwal and Duchi 2011; Dean et al. 2012; Ho et al. 2013; 
Li et al. 2014; Zhang and Kwok 2014). The server is allowed 
to use only staled (delayed) information from the workers, 
and thus only needs to wait for a much smaller number of 
workers in each iteration. Promising theoretical/empirical 
results have been reported. One prominent example of asyn¬ 
chronous SGD is the downpour SGD (Dean et al. 2012). 
Each worker independently reads the parameter from the 
server, computes the local gradient, and sends it back to the 
server. The server then immediately updates the parameter 
using the worker’s gradient information. Using an adaptive 
learning rate (Duchi, Hazan, and Singer 2011), downpour 
SGD achieves state-of-the-art performance. 

However, in order for these algorithms to converge, the 
learning rate has to decrease not only with the number of 
iterations (as in standard single-machine SGD algorithms 
(Bottou 2010)), but also with the maximum delay r (i.e., 
the duration between the time the gradient is computed by 
the worker and it is used by the server) (Ho et al. 2013). On 
the other hand, note that downpour SGD does not constrain 
T, but no convergence guarantee is provided. 

In practice, a decreasing learning rate leads to slower con¬ 
vergence (Bottou 2010; Johnson and Zhang 2013). Recently, 
Eeyzmahdavian, Aytekin, and Johansson (2014) proposed 
the delayed proximal gradient method in which the delayed 
gradient is used to update an analogously delayed model pa¬ 
rameter (but not its current one). It is shown that even with 
a constant learning rate, the algorithm converges linearly to 
within e of the optimal solution. However, to achieve a small 
e, the learning rate needs to be small, which again means 



slow convergence. 

Recently, there has been the flourish development of 
variance reduction techniques for SGD. Examples include 
stochastic average gradient (SAG) (Roux, Schmidt, and 
Bach 2012), stochastic variance reduced gradient (SVRG) 
(Johnson and Zhang 2013), minimization by incremental 
surrogate optimization (MISO) (Mairal 2013; 2015), SAGA 
(Defazio, Bach, and Lacoste-Julien 2014), stochastic dual 
coordinate descent (SDCA) (Shalev-Shwartz and Zhang 
2013), and Proximal SVRG (Xiao and Zhang 2014). The 
idea is to use past gradients to progressively reduce the 
stochastic gradient’s variance, so that a constant learning 
rate can again be used. When the optimization objective is 
strongly convex and Lipschitz-smooth, all these variance- 
reduced SGD algorithms converge linearly to the optimal 
solution. However, their space requirements are different. 
In particular, SVRG is advantageous in that it only needs 
to store the averaged sample gradient, while SAGA and 
SAG have to store all the samples’ most recent gradients. 
Recently, Mania et al. (2015) and Reddi et al. (2015) ex¬ 
tended SVRG to the parallel asynchronous setting. Their al¬ 
gorithms are designed for shared-memory multi-core sys¬ 
tems, and assume that the data samples are sparse. How¬ 
ever, in a distributed computing environment, the samples 
need to be mini-batched to reduce the communication over¬ 
head between workers and server. Even when the samples 
are sparse, the resultant mini-batch typically is not. 

In this paper, we propose a distributed asynchronous 
SGD-based algorithm with variance reduction, and the data 
samples can be sparse or dense. The algorithm is easy to 
implement, highly scalable, uses a constant learning rate, 
and converges linearly to the optimal solution. A prototype 
is implemented on the Google Cloud Computing Platform. 
Experiments on several big data sets from the Pascal Large 
Scale Learning Challenge and LibSVM archive demonstrate 
that it outperforms the state-of-the-art. 

The rest of the paper is organized as follows. We first 
introduce related work. Next, we present the proposed dis¬ 
tributed asynchronous algorithm. This is then followed by 
experimental results including comparisons with the state- 
of-the-art distributed asynchronous algorithms, and the last 
section gives concluding remarks. 

Related Work 

Consider the following optimization problem 

1 ^ 

minE’(w;) = — ^/,(w). (1) 

W iV ^ 

2 = 1 

In many machine learning applications, w is the model 
parameter, N is the number of training samples, and each 
fi : —>■ M is the loss (possibly regularized) due to sample 
i. The following assumptions are commonly made. 

Assumption 1 Each fi is Li-smooth (Nesterov 2004), i.e., 
Mx) < f^{y) + {Vfi(y),x-y) + ^\\x - yf yx,y. 

Assumption 2 F is p-strongly convex (Nesterov 2004), i.e., 
F(x) > F(y) 4- {F{y),x- y) + f ||x - yf yx,y. 


Delayed Proximal Gradient (DPG) 

At iteration t of the DPG (Eeyzmahdavian, Aytekin, and 
Johansson 2014), a worker uses the copy of w de¬ 

layed by Tt iterations, to compute the stochastic gradient 
on a random sample i. The delayed 
gradient is used to update the correspondingly delayed pa¬ 
rameter copy to where p is 

a constant learning rate. This is then sent to the server, 
which obtains the new iterate as a convex combination 
of the current w* and 

= (1 - e)w* + 6 »e(o,i]. (2) 

It can be shown that the sequence converges linearly 
to the optimal solution w*, but only within a tolerance of e, 
i.e., 

E [F(w*) - F{w*)] < p\F(w°) - F(w*)) + e, 

for some p < 1 and e > 0. The tolerance e can be reduced by 
reducing p, though at the expense of increasing p and thus 
slowing down convergence. Moreover, though the learning 
rate of DPG is typically larger than that of SGD, the gradient 
of DPG (i.e., ru*”'’’*) is delayed and slows convergence. 

Stochastic Variance Reduced Gradient 

The SGD, though simple and scalable, has a slower con¬ 
vergence rate than batch gradient descent (Mairal 2013). As 
noted in (Johnson and Zhang 2013), the underlying reason 
is that the stepsize of SGD has to be decreasing so as to con¬ 
trol the gradient’s variance. Recently, by observing that the 
training set is always finite in practice, a number of tech¬ 
niques have been developed to reduce this variance and thus 
allows the use of a constant stepsize (Defazio, Bach, and 
Lacoste-Julien 2014; Johnson and Zhang 2013; Mairal 2013; 
Roux, Schmidt, and Bach 2012; Xiao and Zhang 2014). 

In this paper, we focus on one of most popular techniques 
in this family, namely the stochastic variance reduction gra¬ 
dient (SVRG) (Johnson and Zhang 2013) (Algorithm 1). It 
is advantageous in that no extra space is needed for the in¬ 
termediate gradients or dual variables. The algorithm pro¬ 
ceeds in stages. At the beginning of each stage, the gradi¬ 
ent y/F{w) = V/i(r(;) is computed on the whole 

data set using a past parameter estimate w (which is updated 
across stages). Eor each subsequent iteration t in this stage, 
the approximate gradient 

V/,(rc*) = V/,(tt;*) - V/,(th) + VF{w) 

is used, where i is a sample randomly selected from 
{1,2,...,A^}. Even with a constant learning rate p, the (ex¬ 
pected) variance of V/(w*) goes to zero progressively, and 
the algorithm achieves linear convergence. 

In contrast to DPG, SVRG can converge to the optimal 
solution. However, though SVRG has been extended to the 
parallel asynchronous setting on shared-memory multi-core 
systems (Reddi et al. 2015; Mania et al. 2015), its use and 
convergence properties in a distributed asynchronous learn¬ 
ing setting remain unexplored. 



Algorithm 1 Stochastic variance reduced gradient (SVRG) 
(Johnson and Zhang 2013). 


1 

Initialize w®; 


2 

for s = 1,2,... do 


3 

w = w^~^; 


4 

yF{w) = j,j:lyMwy, 


5 

= w; 


6 

for f = 0,1,..., m — 1 do 


7 

randomly pick i G {1,..., TV}; 


8 

= w* — rjSJ fi(w*)-. 


9 

end for 


10 

set w® = w* for a randomly chosen t G {0,.. 
1}; 

. jTn — 

11 

end for 



Proposed Algorithm 

In this section, we consider the distributed asynchronous 
setting, and propose a hybrid of an improved DPG algo¬ 
rithm and SVRG that inherits the advantages of both. Simi¬ 
lar to the two algorithms, it also uses a constant learning rate 
(which is typically larger than the one used by SGD), but 
with guaranteed linear convergence to the optimal solution. 

Update using Delayed Gradient 

We replace the SVRG update (line 8 in Algorithm 1) by 

= {1 - ey + , (3) 

where 

V* = w* — 

and 

— r]V (4) 

Obviously, when Tt = 0, (3) reduces to standard SVRG. 
Note that both the parameter and gradient in are for 

the same iteration (t — Tt), while v* is noisy as the gradient 
is delayed (by Tt). This delayed gradient cannot be too old. 
Thus, similar to (Ho et al. 2013), we impose the bounded de¬ 
lay condition that Tt < t for some r > 0. This r parameter 
determines the maximum duration between the time the gra¬ 
dient is computed and till it is used. A larger r allows more 
asynchronicity, but also adds noise to the gradient and thus 
may slow convergence. 

Update (3) is similar to (2) in DPG, but with two impor¬ 
tant differences. First, the gradient V ft ) in DPG is re¬ 

placed by its variance-reduced counterpart Vfi(w*'~y. As 
will be seen, this allows convergence to the optimal solution 
using a constant learning rate. The second difference is that 
the delayed gradient is used not only on the past 

iterate but also on the current iterate w*. This can 

potentially yield faster progress, as is most apparent when 
0 = 0. In this special case, DPG reduces to = w*, and 
makes no progress; while (3) reduces to the asynchronous 
SVRG update in (Reddi et al. 2015). 

Mini-Batch 

In a distributed algorithm, communication overhead is in¬ 
curred when a worker pulls parameters from the server or 


pushes update to it. In a distributed SGD-based algorithm, 
the communication cost is proportional to the number of gra¬ 
dient evaluations made by the workers. Similar to the other 
SGD-based distributed algorithms (Gimpel, Das, and Smith 
2010; Dean et al. 2012; Ho et al. 2013), this cost can be re¬ 
duced by the use of a mini-batch. Instead of pulling parame¬ 
ters from the server after every sample, the worker pulls only 
after processing each mini-batch of size B. 

Distributed Implementation 

There is a scheduler, a server and P workers. The server 
keeps a clock (denoted by an integer t), the most updated 
copy of parameter w, a past parameter estimate w and the 
corresponding full gradient WF{w) evaluated on the whole 
training set T> (with N samples). We divide T> into P disjoint 
subsets T>i,T> 2 , ■. ■, T>p, where Dp is owned by worker p. 
The number of samples in Dp is denoted Up. Each worker p 
also keeps a local copy Wp of w. 

In the following, a task refers to an event timestamped by 
the scheduler. It can be issued by the scheduler or a worker, 
and received by either the server or workers. Each worker 
can only process one task at a time. There are two types of 
tasks, update task and evaluation task, and will be discussed 
in more detail in the sequel. A worker may pull the param¬ 
eter from the server by sending a request, which carries the 
type and timestamp of the task being run by the worker. 

Scheduler The scheduler (Algorithm 2) runs in stages. In 
each stage, it first issues m update tasks to the workers, 
where m is usually a multiple of \N/B~\ as in SVRG. After 
spawning enough tasks, the server measures the progress by 
issuing an evaluation task to the server and all workers. As 
will be seen, the server ensures that evaluation is carried out 
only after all update tasks for the current stage have finished. 
If the stopping condition is met, the scheduler informs the 
server and all workers by issuing a STOP command; other¬ 
wise, it moves to the next stage and sends more update tasks. 


Algorithm 2 Scheduler. 

1: for s = 1,..., S' do 
2: for k = 1,..., m do 

3: pick worker p with probability 

4: issue an update task to the worker with timestamp 

t = {s — l)m fc; 

5: end for 

6: issue an evaluation task (with timestamp t = sm -f 1) 

to workers and server; 

7: wait and collect progress information from workers; 

8: if progress meets stopping condition then 

9: issue a STOP command to the workers and server; 

10: end if 

11: end for 


Worker At stage s, when worker p receives an update task 
with timestamp t, it sends a parameter pull request to the 
server. This request will not be responded by the server until 
it finishes all tasks with timestamps before t — t. 






Let Wp^t be the parameter value pulled. Worker p selects 
a mini-batch C Vp (of size B) randomly from its local 
data set. Analogous to in (4), it computes 

Wp^t = Wp,t - V^Wp^t, (5) 

where Awp^t is the mini-batch gradient evaluated at lip^t- 
An update task is then issued to push Wp t and Awp^t to the 
server. 

When a worker receives an evaluation task, it again sends 
a parameter pull request to the server. As will be seen in the 
following section, the pulled Wp^t will always be the latest 
w kept by the server in the current stage. Hence, the Wp^s 
pulled by all workers are the same. Worker p then updates 
Wp as Wp = Wp^t, computes and pushes the corresponding 
gradient 

VFp{wp) = — X! 

to the server. To inform the scheduler of its progress, worker 
p also computes its contribution to the optimization objec¬ 
tive fiiwp) and pushes it to the scheduler. The whole 

worker procedure is shown in Algorithm 3. 


Algorithm 3 Worker p receiving an update/evaluation task t 
at stage s. 

1: send a parameter pull request to the server; 

2: wait for response from the server; 

3: if task t is an update task then 

4: pick a mini-batch subset randomly from the local 

data set; 

5: compute mini-batch gradient Awp^t and using 

(5), and push them to the server as an update task; 

6: else 

7: set Wp = Wp^u {task t is an evaluation task} 

8: push the local subset gradient VFpiwp) to the server 

as an update task; 

9: push the local objective value to the scheduler; 

10: end if 


Server There are two threads running on the server. One 
is a daemon thread that responds to parameter pull requests 
from workers (Algorithm 4); and the other is a computing 
thread for handling update tasks from workers and evalua¬ 
tion tasks from the scheduler (Algorithm 5). 

When the daemon thread receives a parameter pull re¬ 
quest, it reads the type and timestamp t within. If the request 
is from a worker running an update task, it checks whether 
all update tasks before t — r have finished. If not, the request 
remains in the buffer; otherwise, it pushes its w value to 
the requesting worker. Thus, this controls the allowed asyn¬ 
chronicity. On the other hand, if the request is from a worker 
executing an evaluation task, the daemon thread does not 
push w to the workers until all update tasks before t have 
finished. This ensures that the w pulled by the worker is the 
most up-to-date for the current stage. 

When the computing thread receives an update task (with 
timestamp t) from worker p, the Wp^t and Awp^t contained 


Algorithm 4 Daemon thread of the server. 

1: repeat 

2: if pull request buffer is not empty then 

3: for each request with timestamp t in the buffer do 

4: if request is triggered by an update task then 

5: if all update tasks before t — r have finished 

then 

6: push w to the requesting worker; 

7: remove request from buffer; 

8 : end if 

9: else 

10: if all update tasks before t have finished then 

11: push w to the requesting worker; 

12: remove request from buffer; 

13: end if 

14: end if 

15: end for 

16: else 

17: sleep for a while; 

18: end if 

19: until STOP command is received. 


Algorithm 5 Computing thread of the server. 

1: repeat 

2: wait for tasks; 

3: if an update task received then 

4: update w using (6), and mark this task as finished; 

5: else 

6: wait for all update tasks to finish; 

7: set w = w; 

8: collect local full gradients from workers and up¬ 

date WF{w)-, 

9: broadcast VF{w) to all workers; 

10: end if 

11: until STOP command is received. 


inside are read. Analogous to (3), the server updates w as 

w = {1 - 9){w - rjAwp^t) + Owp^t, ( 6 ) 

and marks this task as finished. During the update, the com¬ 
puting thread locks w so that the daemon thread cannot ac¬ 
cess until the update is finished. 

When the server receives an evaluation task, it synchro¬ 
nizes all workers, and sets w = w. As all Wp’s are the same 
and equal to w, one can simply aggregate the local gradients 
to obtain S/F{w) = I]p=i where qp = 

The server then broadcasts WF{w) to all workers. 

Discussion 

Two state-of-the-art distributed asynchronous SGD algo¬ 
rithms are the downpour SGD (Dean et al. 2012) and Petuum 
SGD (Ho et al. 2013; Dai et al. 2013). Downpour SGD 
does not impose the bounded delay condition (essentially, 
T = oo), while Petuum SGD does. Note that there is a sub¬ 
tle difference in the bounded delay condition of the proposed 
algorithm and that of Petuum SGD. In Petuum SGD, the 
amount of staleness is measured between workers, namely 









that the slowest and fastest workers must be less than s 
timesteps apart (where s is the staleness parameter). Con¬ 
sequently, the delay in the gradient is always a multiple of 
P and is upper-bounded by sP. On the other hand, in the 
proposed algorithm, the bounded delay condition is imposed 
on the update tasks. It can be easily seen that t is also the 
maximum delay in the gradient. Thus, t can be any number 
which is not necessarily a multiple of P. 


sets (Table 1) from the LibSVM archive^ and Pascal Large 
Scale Learning Challenge^. 



#samples 

#features 

#classes 

MnistSm 

8 , 100,000 

784 

10 

DNA 

50,000,000 

800 

2 


Table 1: Summary of the data sets used. 


Convergence Analysis 

For simplicity of analysis, we assume that the mini-batch 
size is one. Let be the w stored in the server at the end 
of stage S (step 7 in Algorithm 5). The following Theorem 
shows linear convergence of the proposed algorithm. It is 
the first such result for distributed asynchronous SGD-based 
algorithms with constant learning rate. 


Using the Google Cloud Computing Platform^, we set up 
a cluster with 18 computing nodes. Each node is a google 
cloud nl-highmem -8 instance with eight cores and 52GB 
memory. Each scheduler/server takes one instance, while 
each worker takes a core. Thus, we have a maximum of 128 
workers. The system is implemented in C-H-, with the Ze- 
roMQ package for communication. 


Theorem 3 Suppose that problem (1) satisfies Assump¬ 
tion 1 and 2. Let L = maxjLi}^]^, and 7 = 

(1 - 2r]{p- +60^- WithpG ( 0 , ^)andm 

sufficiently large such that 7 < 1. Assume that the scheduler 
has been run for S stages, we obtain the following linear 
rate: 

E[F{w^) - F(w*)] < j^[F(w°)-F{w*)]. 

As L > /i, it is easy to see that 1 — 2p(/i — and 

01 ^-riL^ smaller than 1. Thus, 7 < 1 can be guaran¬ 

teed for a sufficiently large m. Moreover, as F is strongly 
convex, the following Corollary shows that also con¬ 
verges to w*. In contrast, DPG only converges to within a 
tolerance of e. 

Corollary 4 E||w‘5 - ^*112 < 27 ‘ 5 [F(w°) - F{w*)]/p,. 

When T < P, the server can serve at most t workers si¬ 
multaneously. Eor maximum parallelism, t should increase 
with P. However, 7 also increases with r. Thus, a larger m 
and/or S may be needed to achieve the same solution quality. 

Similar to DPG, our learning rate does not depend on the 
delay t and the number of workers P. This learning rate can 
be significantly larger than the one in Pentuum SGD (Ho 
et al. 2013), which has to be decayed and is proportional 
to 0{l/\/P). Thus, the proposed algorithm can be much 
faster, as will be confirmed in the experiments. While our 
bound may be loose due to the use of worst-case analysis, 
linear convergence is always guaranteed for any 9 G ( 0 , 1 ]. 


Experiments 

In this section, we consider the AT-class logistic regression 
problem: 


min 


-I{yi = k) log 

2=1 fe=l 


exp(wfca:i) 

Efciexp(wfa;i) 


Comparison with the State-of-the-Art 

In this section, the following distributed asynchronous algo¬ 
rithms are compared: 

• Downpour SGD (“downpour-sgd”) (Dean et al. 2012), 
with the adaptive learning rate in Adagrad (Duchi, Hazan, 
and Singer 2011); 

• Petuum SGD (Dai et al. 2013) {“petuum-sgd”), the state- 
of-the-art implementation of asynchronous SGD. The 
learning rate is reduced by a fixed factor 0.95 at the end 
of each epoch. The staleness s is set to 2, and so the delay 
in the gradient is bounded by 2P. 

• DPG (Eeyzmahdavian, Aytekin, and Johansson 2014) 
(“dpg’y, 

• A variant of DPG (“vr-dpg“), in which the gradient in up¬ 
date ( 2 ) is replaced by its variance-reduced version; 

• The proposed “distributed variance-reduced stochastic 
gradient decent” (distr-vr-sgd) algorithm. 

• A special case of distr-vr-sgd, with 6 = 0 (denoted “distr- 
svrg“). This reduces to the asynchronous SVRG algorithm 
in (Reddi et al. 2015). 

We use 128 workers. To maximize parallelism, we fix t 
to 128. The Petuum SGD code is downloaded from http : 
/ /petuum. github . io/, while the other asynchronous 
algorithms are implemented in C-H- by reusing most of our 
system’s codes. Preliminary studies show that synchronous 
SVRG is much slower and so is not included for compar¬ 
ison. Eor distr-vr-sgd and distr-svrg, the number of stages 
is S' = 50, and the number of iterations in each stage is 
m = \N/B~\ , where B is about 10% of each worker’s local 
data set size. Eor fair comparison, the other algorithms are 
run for mS iterations. All other parameters are tuned by a 
validation set, which is 1 % of the data set. 

Eigure 1 shows convergence of the objective w.r.t. wall 
clock time. As can be seen, distr-vr-sgd outperforms all 
the other algorithms. Moreover, unlike dpg, it can converge 


where {(a;*, the training samples, Wk is the pa¬ 

rameter vector of class k, and If) is the indicator function 
which returns 1 when the argument holds, and 0 otherwise. 
Experiments are performed on the MnistSm and DNA data 


'https://www.csie.ntu.edu.tw/~cjlin/ 
libsvmtools/datasets/ 

^http://argescale.ml.tu-berlin.de/ 
^http://cloud.google.com 



to the optimal solution and attains a much smaller objec¬ 
tive value. Note that distr-svrg is slow. Since r = 128, 
the delayed gradient can be noisy, and the learning rate 
used by distr-svrg (as determined by the validation set) is 
small (10“® vs 10“^ in distr-vr-sgd). On the DNA data set, 
distr-svrg is even slower than petuum-sgd and downpour-sgd 
(which use adaptive/decaying learning rates). The vr-dpg, 
which uses variance-reduced gradient, is always faster than 
dpg. Moreover, distr-vr-sgd is faster than vr-sgd, showing 
that replacing ru* in (2) by u* in (3) is useful. 




(a) MnistSm. (b) DNA. 

Figure 1: Objective vs time (in sec). 

Varying the Number of Workers 

In this experiment, we run distr-vr-sgd with varying number 
of workers (16, 32, 64 and 128) until a target objective value 
is met. Figure 2 shows convergence of the objective with 
time. On the MnistSm data set, using 128 workers is about 3 
times faster than using 16 workers. On the DNA data set, the 
speedup is about 6 times. 




(a) MnistSm. 


(b) DNA. 


Figure 2: Objective vs time (in sec), with different numbers 
of workers. 




(a) MnistSm. 


(b) DNA. 


Figure 3; Breakdown into computation time and communi¬ 
cation time, with different numbers of workers. 


Note that the most expensive step in the algorithm is on 
gradient evaluations (the scheduler and server operations are 
simple). Recall that each stage has m iterations, and each 
iteration involvs 0{B) gradient evaluations. At the end of 
each stage, an additional 0{N) gradients are evaluated to 
obtain the full gradient and monitor the progress. Hence, 
each worker spends 0{{mB -\- N)/P) time on computa¬ 
tion. The computation time thus decreases linearly with P, 
as can be seen from the breakdown of total wall clock time 
into computation time and communication time in Figure 3. 

Moreover, having more workers means more tasks/data 
can be sent among the server and workers simultaneously, 
reducing the communication time. On the other hand, as 
synchronization is required among all workers at the end of 
each stage, having more workers increases the communica¬ 
tion overhead. Hence, as can be seen from Figure 3, the com¬ 
munication time first decreases with the number of workers, 
but then increases as the communication cost in the synchro¬ 
nization step starts to dominate. 

Effect of r 

In this experiment, we use 128 workers. Figure 4 shows the 
time for distr-vr-sgd to finish mS tasks (where S' = 50 and 
m = \N/B~\) when r is varied from 10 to 200. As can be 
seen, with increasing r, higher asynchronicity is allowed, 
and the communication cost is reduced significantly. 




(a) MnistSm. 


(b) DNA. 


Figure 4; Breakdown of the total time into computation time 
and communication time, with different t’s. 


Conclusion 

Existing distributed asynchronous SGD algorithms often 
rely on a decaying learning rate, and thus suffer from a sub- 
linear convergence rate. On the other hand, the recent de¬ 
layed proximal gradient algorithm uses a constant learning 
rate and has linear convergence rate, but can only converge 
to within a neighborhood of the optimal solution. In this pa¬ 
per, we proposed a novel distributed asynchronous SGD al¬ 
gorithm by integrating the merits of the stochastic variance 
reduced gradient algorithm and delayed proximal gradient 
algorithm. Using a constant learning rate, it still guarantees 
convergence to the optimal solution at a fast linear rate. A 
prototype system is implemented and run on the Google 
cloud platform. Experimental results show that the proposed 
algorithm can reduce the communication cost significantly 
with the use of asynchronicity. Moreover, it converges much 
faster and yields more accurate solutions than the state-of- 
the-art distributed asynchronous SGD algorithms. 
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